-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement when counting num spikes #2198
Conversation
Spike vector should be the default, as computing 1 spike train will mean the cache is available (but doesn't mean it's faster to use the cached spike trains)
Have you profiled it? In the spike trains approach, you just count arrays in memory. With the spike vector approach you need a |
The problem is that if you called
|
Which "old" version? If they are all cached it should still be faster, but I agree that they might not be all cached and that unique is fast. So good for me ;) |
By "old" I meant current (not this PR) |
@DradeAW : I will build on top of this PR a option to return a dict or a numpy.array for counting spikes. |
See this PR which is built on top of this one : #2209 |
Yep makes sense! Also another useful option would be to return per segment or the sum over all segments? |
I agree with @alejoe91 that you should profile this.
Really optimization is something that should not be based on guesses or expected behavior. |
Here is a benchmark for a big soring ( import time
import numpy as np
import spikeinterface.core as si
sorting = si.load_extractor("/mnt/raid0/data/MEArec/1h_3000cells/analyses/ks2_5_pj7-3/sorting")
N = 100
time_spike_trains = np.empty(N, dtype=np.float32)
time_spike_vector = np.empty(N, dtype=np.float32)
# Cache spike vector and spike trains
sorting.to_spike_vector(use_cache=True)
for unit_id in sorting.unit_ids:
sorting.get_unit_spike_train(unit_id, use_cache=True)
# Benchmark
for i in range(N):
t1 = time.perf_counter()
sorting.count_num_spikes_per_unit(method="spike_trains") # I added the argument 'method' to benchmark
t2 = time.perf_counter()
sorting.count_num_spikes_per_unit(method="spike_vector")
t3 = time.perf_counter()
time_spike_trains[i] = t2 - t1
time_spike_vector[i] = t3 - t2
print(f"Spike trains: {1e3*np.mean(time_spike_trains):.1f} +- {1e3*np.std(time_spike_trains):.1f} ms")
print(f"Spike vector: {1e3*np.mean(time_spike_vector):.1f} +- {1e3*np.std(time_spike_vector):.1f} ms") Output:
So a clear difference, but also clearly under a second for each. |
The diff on git made this very confusing to me. You changed the method to use the Also, why did you change the else branch? |
I didn't change the else, I just switched which one came first. The idea is that, right now, I don't know how to check whether all the spike trains have been computed, and the computation is only faster if you have all of them (as shown in my other PR where computing spike trains is very slow) If you don't have the spike vector to begin with then it default to looking at the spike trains, which is the correct behaviour |
The #2209 is now doing a better choice on top of this one. |
Agreed, I'll close this PR :) |
Spike vector should be the default, as computing 1 spike train will mean the cache is available (but doesn't mean it's faster to use the cached spike trains)